Understanding Diffusion Models: A Unified Perspective

理解扩散模型:统一视角

Calvin Luo

Calvin Luo

Google Research, Brain Team

谷歌研究院,Brain团队

calvinluo@google.com

August 26, 2022

2022年8月26日

Contents

目录

Introduction: Generative Models 1

引言:生成模型 1

Background: ELBO, VAE, and Hierarchical VAE 2

背景:证据下界(ELBO)、变分自编码器(VAE)及分层变分自编码器 2

Evidence Lower Bound 2

证据下界 2

Variational Autoencoders 4

变分自编码器 4

Hierarchical Variational Autoencoders 5

分层变分自编码器 5

Variational Diffusion Models 6

变分扩散模型 6

Learning Diffusion Noise Parameters 14

扩散噪声参数学习 14

Three Equivalent Interpretations 15

三种等价解释 15

Score-based Generative Models 17

基于分数的生成模型 17

Guidance 20

引导方法 20

Classifier Guidance 21

分类器引导 21

Classifier-Free Guidance 21

无分类器引导 21

Closing 22

结束 22

Introduction: Generative Models

引言:生成模型

Given observed samples x from a distribution of interest,the goal of a generative model is to learn to model its true data distribution p(x) . Once learned,we can generate new samples from our approximate model at will. Furthermore, under some formulations, we are able to use the learned model to evaluate the likelihood of observed or sampled data as well.

给定来自感兴趣分布的观测样本x,生成模型的目标是学习其真实数据分布p(x)。一旦学习完成,我们就可以随意从近似模型中生成新样本。此外,在某些形式下,我们还能利用学习到的模型评估观测或采样数据的似然。

There are several well-known directions in current literature, that we will only introduce briefly at a high level. Generative Adversarial Networks (GANs) model the sampling procedure of a complex distribution, which is learned in an adversarial manner. Another class of generative models, termed "likelihood-based", seeks to learn a model that assigns a high likelihood to the observed data samples. This includes autoregressive models, normalizing flows, and Variational Autoencoders (VAEs). Another similar approach is energy-based modeling, in which a distribution is learned as an arbitrarily flexible energy function that is then normalized.

当前文献中有几种著名的方向,我们这里只做简要介绍。生成对抗网络(Generative Adversarial Networks,GANs)通过对抗方式学习复杂分布的采样过程。另一类称为“基于似然”的生成模型,旨在学习一个对观测数据样本赋予高似然的模型,包括自回归模型、归一化流(normalizing flows)和变分自编码器(Variational Autoencoders,VAEs)。另一种类似方法是能量基模型(energy-based modeling),其中分布被学习为一个任意灵活的能量函数,随后进行归一化。

Score-based generative models are highly related; instead of learning to model the energy function itself, they learn the score of the energy-based model as a neural network. In this work we explore and review diffusion models, which as we will demonstrate, have both likelihood-based and score-based interpretations. We showcase the math behind such models in excruciating detail, with the aim that anyone can follow along and understand what diffusion models are and how they work.

基于得分的生成模型与此高度相关;它们不是直接学习能量函数本身,而是将能量基模型的得分作为神经网络来学习。在本工作中,我们探讨并回顾扩散模型(diffusion models),正如我们将展示的,这些模型既有基于似然的解释,也有基于得分的解释。我们详尽展示这些模型背后的数学原理,旨在让任何人都能跟随理解扩散模型是什么及其工作原理。

Background: ELBO, VAE, and Hierarchical VAE

背景:证据下界(ELBO)、变分自编码器(VAE)及层次变分自编码器

For many modalities, we can think of the data we observe as represented or generated by an associated unseen latent variable,which we can denote by random variable z . The best intuition for expressing this idea is through Plato's Allegory of the Cave. In the allegory, a group of people are chained inside a cave their entire life and can only see the two-dimensional shadows projected onto a wall in front of them, which are generated by unseen three-dimensional objects passed before a fire. To such people, everything they observe is actually determined by higher-dimensional abstract concepts that they can never behold.

对于许多模态,我们可以将观测数据视为由一个相关的未观测潜变量生成或表示,该潜变量用随机变量z表示。表达这一思想的最佳直观例子是柏拉图的洞穴寓言。在寓言中,一群人终生被锁链束缚在洞穴内,只能看到投射在他们面前墙上的二维影子,这些影子由火光前经过的三维物体投射而成。对这些人来说,他们所观察到的一切实际上是由他们永远无法见到的高维抽象概念决定的。

Analogously, the objects that we encounter in the actual world may also be generated as a function of some higher-level representations; for example, such representations may encapsulate abstract properties such as color, size, shape, and more. Then, what we observe can be interpreted as a three-dimensional projection or instantiation of such abstract concepts, just as what the cave people observe is actually a two-dimensional projection of three-dimensional objects. Whereas the cave people can never see (or even fully comprehend) the hidden objects, they can still reason and draw inferences about them; in a similar way, we can approximate latent representations that describe the data we observe.

类似地,我们在现实世界中遇到的物体也可能是某些更高层次表示的函数生成的;例如,这些表示可能包含颜色、大小、形状等抽象属性。然后,我们所观察到的可以被解释为这些抽象概念的三维投影或实例,就像洞穴中的人所观察到的实际上是三维物体的二维投影一样。虽然洞穴中的人永远无法看到(甚至完全理解)隐藏的物体,但他们仍能推理并对其做出推断;同样,我们也可以近似描述观测数据的潜在表示。

Whereas Plato's Allegory illustrates the idea behind latent variables as potentially unobservable representations that determine observations, a caveat of this analogy is that in generative modeling, we generally seek to learn lower-dimensional latent representations rather than higher-dimensional ones. This is because trying to learn a representation of higher dimension than the observation is a fruitless endeavor without strong priors. On the other hand, learning lower-dimensional latents can also be seen as a form of compression, and can potentially uncover semantically meaningful structure describing observations.

柏拉图的寓言阐释了潜变量作为可能不可观测的表示决定观测的思想,但这一比喻的一个警示是,在生成建模中,我们通常寻求学习低维潜变量而非高维潜变量。这是因为在没有强先验的情况下,试图学习比观测更高维的表示是徒劳的。另一方面,学习低维潜变量也可视为一种压缩形式,且有可能揭示描述观测的语义结构。

Evidence Lower Bound

证据下界(Evidence Lower Bound)

Mathematically, we can imagine the latent variables and the data we observe as modeled by a joint distribution p(x,z) . Recall one approach of generative modeling,termed "likelihood-based",is to learn a model to maximize the likelihood p(x) of all observed x . There are two ways we can manipulate this joint distribution to recover the likelihood of purely our observed data p(x) ; we can explicitly marginalize out the latent

从数学上讲,我们可以将潜变量和观测数据视为由联合分布p(x,z)建模。回顾一种生成建模方法,称为“基于似然”的方法,旨在学习一个模型以最大化所有观测数据x的似然p(x)。我们可以通过两种方式操作该联合分布以恢复纯观测数据的似然p(x);一种是显式边缘化潜变量p(x,z)

variable z :

变量z

(1)p(x)=p(x,z)dz

or, we could also appeal to the chain rule of probability:

或者,我们也可以利用概率链式法则:

(2)p(x)=p(x,z)p(zx)

Directly computing and maximizing the likelihood p(x) is difficult because it either involves integrating out all latent variables z in Equation 1,which is intractable for complex models,or it involves having access to a ground truth latent encoder p(zx) in Equation 2. However,using these two equations,we can derive a term called the Evidence Lower Bound (ELBO), which as its name suggests, is a lower bound of the evidence. The evidence is quantified in this case as the log likelihood of the observed data. Then, maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model; in the best case, when the ELBO is powerfully parameterized and perfectly optimized, it becomes exactly equivalent to the evidence. Formally, the equation of the ELBO is:

直接计算并最大化似然函数p(x)是困难的,因为这要么涉及对方程1中所有潜变量z进行积分,这对于复杂模型来说是不可解的,要么需要访问方程2中的真实潜变量编码器p(zx)。然而,利用这两个方程,我们可以推导出一个称为证据下界(Evidence Lower Bound,ELBO)的项,顾名思义,它是证据的下界。在此情况下,证据被量化为观测数据的对数似然。然后,最大化ELBO成为优化潜变量模型的代理目标;在最佳情况下,当ELBO被强有力地参数化并且被完美优化时,它与证据完全等价。形式上,ELBO的方程为:

(3)Eqϕ(zx)[logp(x,z)qϕ(zx)]

To make the relationship with the evidence explicit, we can mathematically write:

为了明确与证据的关系,我们可以用数学形式表示为:

(4)logp(x)Eqϕ(zx)[logp(x,z)qϕ(zx)]

Here, qϕ(zx) is a flexible approximate variational distribution with parameters ϕ that we seek to optimize. Intuitively, it can be thought of as a parameterizable model that is learned to estimate the true distribution over latent variables for given observations x ; in other words,it seeks to approximate true posterior p(zx) . As we will see when exploring the Variational Autoencoder, as we increase the lower bound by tuning the parameters ϕ to maximize the ELBO,we gain access to components that can be used to model the true data distribution and sample from it, thus learning a generative model. For now, let us try to dive deeper into why the ELBO is an objective we would like to maximize.

这里,qϕ(zx)是一个具有参数ϕ的灵活近似变分分布,我们希望对其进行优化。从直观上讲,它可以被视为一个可参数化的模型,用于学习估计给定观测x的潜变量的真实分布;换句话说,它试图逼近真实后验分布p(zx)。正如我们在探索变分自编码器(Variational Autoencoder)时将看到的,通过调节参数ϕ以最大化ELBO,我们提高了下界,从而获得了可用于建模真实数据分布并从中采样的组件,进而学习生成模型。现在,让我们深入探讨为什么ELBO是我们希望最大化的目标。

Let us begin by deriving the ELBO, using Equation 1:

让我们从使用方程1推导ELBO开始:

(5)logp(x)=logp(x,z)dz (Apply Equation 1) 

(6)=logp(x,z)qϕ(zx)qϕ(zx)dz(Multiply by1=qϕ(zx)qϕ(zx))

(7)=logEqϕ(zx)[p(x,z)qϕ(zx)] (Definition of Expectation) 

(8)Eqϕ(zx)[logp(x,z)qϕ(zx)] (Apply Jensen’s Inequality) 

In this derivation, we directly arrive at our lower bound by applying Jensen's Inequality. However, this does not supply us much useful information about what is actually going on underneath the hood; crucially, this proof gives no intuition on exactly why the ELBO is actually a lower bound of the evidence, as Jensen's Inequality handwaves it away. Furthermore, simply knowing that the ELBO is truly a lower bound of the data does not really tell us why we want to maximize it as an objective. To better understand the relationship between the evidence and the ELBO, let us perform another derivation, this time using Equation 2:

在此推导中,我们通过应用詹森不等式直接得到了下界。然而,这并未为我们提供关于内部机制的有用信息;关键是,这个证明并未直观说明为什么ELBO实际上是证据的下界,因为詹森不等式对此只是含糊带过。此外,仅仅知道ELBO确实是数据的下界,并不能真正告诉我们为什么要将其作为目标进行最大化。为了更好地理解证据与ELBO之间的关系,让我们进行另一种推导,这次使用方程2:

logp(x)=logp(x)qϕ(zx)dz

(9)(Multiply by1=qϕ(zx)dz)

=qϕ(zx)(logp(x))dz (Bring evidence into integral)

=qϕ(zx)(logp(x))dz(将证据引入积分)

(10)

(11)=Eqϕ(zx)[logp(x)](Definition of Exp

(12)=Eqϕ(zx)[logp(x,z)p(zx)](Apply Equation 2)

(13)=Eqϕ(zx)[logp(x,z)qϕ(zx)p(zx)qϕ(zx)] (Multiply by 1=qϕ(zx)qϕ(zx) ) 

(14)=Eqϕ(zx)[logp(x,z)qϕ(zx)]+Eqϕ(zx)[logqϕ(zx)p(zx)] (Split the Expectation) 

(15)=Eqϕ(zx)[logp(x,z)qϕ(zx)]+DKL(qϕ(zx)p(zx)) (Definition of KL Divergence) 

Eqϕ(zx)[logp(x,z)qϕ(zx)]

(KL Divergence always 0 )

(KL散度总是0

(16)

From this derivation, we clearly observe from Equation 15 that the evidence is equal to the ELBO plus the KL Divergence between the approximate posterior qϕ(zx) and the true posterior p(zx) . In fact,it was this KL Divergence term that was magically removed by Jensen's Inequality in Equation 8 of the first derivation. Understanding this term is the key to understanding not only the relationship between the ELBO and the evidence, but also the reason why optimizing the ELBO is an appropriate objective at all.

从此推导中,我们可以清楚地从方程15看到,证据等于ELBO加上近似后验qϕ(zx)与真实后验p(zx)之间的KL散度。事实上,正是这个KL散度项在第一次推导的方程8中被詹森不等式神奇地去除了。理解这一项是理解ELBO与证据关系的关键,也是理解为什么优化ELBO是合适目标的原因。

Firstly, we now know why the ELBO is indeed a lower bound: the difference between the evidence and the ELBO is a strictly non-negative KL term, thus the value of the ELBO can never exceed the evidence. Secondly,we explore why we seek to maximize the ELBO. Having introduced latent variables z that we would like to model, our goal is to learn this underlying latent structure that describes our observed data. In other words,we want to optimize the parameters of our variational posterior qϕ(zx) to exactly match the true posterior distribution p(zx) ,which is achieved by minimizing their KL Divergence (ideally to zero). Unfortunately, it is intractable to minimize this KL Divergence term directly, as we do not have access to the ground truth p(zx) distribution. However,notice that on the left hand side of Equation 15,the likelihood of our data (and therefore our evidence term logp(x) ) is always a constant with respect to ϕ ,as it is computed by marginalizing out all latents z from the joint distribution p(x,z) and does not depend on ϕ whatsoever. Since the ELBO and KL Divergence terms sum up to a constant, any maximization of the ELBO term with respect to ϕ necessarily invokes an equal minimization of the KL Divergence term. Thus,the ELBO can be maximized as a proxy for learning how to perfectly model the true latent posterior distribution; the more we optimize the ELBO, the closer our approximate posterior gets to the true posterior. Additionally, once trained, the ELBO can be used to estimate the likelihood of observed or generated data as well, since it is learned to approximate the model evidence logp(x) .

首先,我们现在知道为什么ELBO确实是一个下界:证据与ELBO之间的差异是一个严格非负的KL项,因此ELBO的值永远不会超过证据。其次,我们探讨为什么要最大化ELBO。在引入我们希望建模的潜变量z后,我们的目标是学习描述观测数据的潜在结构。换句话说,我们希望优化变分后验qϕ(zx)的参数,使其与真实后验分布p(zx)完全匹配,这通过最小化它们的KL散度(理想情况下为零)来实现。不幸的是,直接最小化该KL散度项是不可行的,因为我们无法获得真实的p(zx)分布。然而,注意到在公式15的左侧,我们数据的似然(因此我们的证据项logp(x))相对于ϕ始终是常数,因为它是通过从联合分布p(x,z)中边缘化所有潜变量z计算得出,且完全不依赖于ϕ。由于ELBO和KL散度项之和为常数,任何关于ϕ的ELBO最大化必然导致KL散度项的等量最小化。因此,ELBO可以作为学习如何完美建模真实潜在后验分布的代理目标进行最大化;我们对ELBO的优化越充分,近似后验就越接近真实后验。此外,训练完成后,ELBO还可用于估计观测或生成数据的似然,因为它被学习用来近似模型证据logp(x)

Figure 1: A Variational Autoencoder graphically represented. Here,encoder q(zx) defines a distribution over latent variables z for observations x ,and p(xz) decodes latent variables into observations.

图1:变分自编码器的图示。这里,编码器q(zx)定义了观测x对应潜变量z的分布,p(xz)将潜变量解码为观测。

Variational Autoencoders

变分自编码器

In the default formulation of the Variational Autoencoder (VAE) [1], we directly maximize the ELBO. This approach is variational,because we optimize for the best qϕ(zx) amongst a family of potential posterior distributions parameterized by ϕ . It is called an autoencoder because it is reminiscent of a traditional au-toencoder model, where input data is trained to predict itself after undergoing an intermediate bottlenecking representation step. To make this connection explicit, let us dissect the ELBO term further:

在变分自编码器(VAE)[1]的默认形式中,我们直接最大化ELBO。该方法是变分的,因为我们在由ϕ参数化的一族潜在后验分布中优化最佳的qϕ(zx)。之所以称为自编码器,是因为它类似于传统的自编码器模型,其中输入数据经过中间瓶颈表示步骤后被训练以预测自身。为了明确这一联系,我们进一步剖析ELBO项:

(17)Eqϕ(zx)[logp(x,z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)p(z)qϕ(zx)]

(18)=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)qϕ(zx)](Split the Expectation)

(19)=Eqϕ(zx)[logpθ(xz)]reconstruction term DKL(qϕ(zx)p(z))prior matching term  (Definition of KL Divergence) 

In this case,we learn an intermediate bottlenecking distribution qϕ(zx) that can be treated as an encoder; it transforms inputs into a distribution over possible latents. Simultaneously, we learn a deterministic function pθ(xz) to convert a given latent vector z into an observation x ,which can be interpreted as a decoder.

在这种情况下,我们学习一个中间瓶颈分布qϕ(zx),可视为编码器;它将输入转换为潜在变量的分布。同时,我们学习一个确定性函数pθ(xz),将给定的潜在向量z转换为观测x,这可以解释为解码器。

The two terms in Equation 19 each have intuitive descriptions: the first term measures the reconstruction likelihood of the decoder from our variational distribution; this ensures that the learned distribution is modeling effective latents that the original data can be regenerated from. The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Minimizing this term encourages the encoder to actually learn a distribution rather than collapse into a Dirac delta function. Maximizing the ELBO is thus equivalent to maximizing its first term and minimizing its second term. A defining feature of the VAE is how the ELBO is optimized jointly over parameters ϕ and θ . The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian:

方程19中的两个项各有直观的解释:第一项衡量解码器从我们的变分分布重构的似然性;这确保了所学习的分布能够建模有效的潜变量,从而能够重生成原始数据。第二项衡量所学习的变分分布与潜变量上的先验分布的相似度。最小化该项鼓励编码器真正学习一个分布,而不是坍缩成狄拉克δ函数。最大化ELBO(证据下界)等价于最大化其第一项并最小化其第二项。VAE(变分自编码器)的一个显著特征是ELBO在参数ϕθ上联合优化。VAE的编码器通常选择建模具有对角协方差的多元高斯分布,先验通常选为标准多元高斯分布:

(20)qϕ(zx)=N(z;μϕ(x),σϕ2(x)I)

(21)p(z)=N(z;0,I)

Then, the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. Our objective can then be rewritten as:

然后,ELBO的KL散度项可以解析计算,重构项可以用蒙特卡洛估计近似。我们的目标函数可以重写为:

(22)argmaxϕ,θEqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))argmaxϕ,θl=1Llogpθ(xz(l))DKL(qϕ(zx)p(z))

where latents {z(l)}l=1L are sampled from qϕ(zx) ,for every observation x in the dataset. However,a problem arises in this default setup: each z(l) that our loss is computed on is generated by a stochastic sampling procedure, which is generally non-differentiable. Fortunately, this can be addressed via the reparameterization trick when qϕ(zx) is designed to model certain distributions,including the multivariate Gaussian.

其中潜变量{z(l)}l=1L是从qϕ(zx)中采样的,针对数据集中每个观测x。然而,在此默认设置下出现了一个问题:我们计算损失的每个z(l)都是通过随机采样过程生成的,该过程通常不可微。幸运的是,当qϕ(zx)被设计为建模某些分布(包括多元高斯分布)时,可以通过重参数化技巧解决此问题。

The reparameterization trick rewrites a random variable as a deterministic function of a noise variable; this allows for the optimization of the non-stochastic terms through gradient descent. For example, samples from a normal distribution xN(x;μ,σ2) with arbitrary mean μ and variance σ2 can be rewritten as:

重参数化技巧将随机变量重写为噪声变量的确定性函数;这允许通过梯度下降优化非随机项。例如,来自均值为μ、方差为σ2的正态分布xN(x;μ,σ2)的样本可以重写为:

x=μ+σϵ with ϵN(ϵ;0,I)

In other words,arbitrary Gaussian distributions can be interpreted as standard Gaussians (of which ϵ is a sample) that have their mean shifted from zero to the target mean μ by addition,and their variance stretched by the target variance σ2 . Therefore,by the reparameterization trick,sampling from an arbitrary Gaussian distribution can be performed by sampling from a standard Gaussian, scaling the result by the target standard deviation, and shifting it by the target mean.

换言之,任意高斯分布可以被解释为标准高斯分布(其中ϵ是样本)通过加法将均值从零平移到目标均值μ,并通过目标方差σ2进行伸缩。因此,通过重参数化技巧,从任意高斯分布采样可以通过先从标准高斯采样,再乘以目标标准差,最后加上目标均值来实现。

In a VAE,each z is thus computed as a deterministic function of input x and auxiliary noise variable ϵ :

在VAE中,每个z因此被计算为输入x和辅助噪声变量ϵ的确定性函数:

z=μϕ(x)+σϕ(x)ϵ with ϵN(ϵ;0,I)

where represents an element-wise product. Under this reparameterized version of z ,gradients can then be computed with respect to ϕ as desired,to optimize μϕ and σϕ . The VAE therefore utilizes the reparam-eterization trick and Monte Carlo estimates to optimize the ELBO jointly over ϕ and θ .

其中表示逐元素乘积。在此重参数化版本的z下,可以根据需要计算关于ϕ的梯度,以优化μϕσϕ。因此,VAE利用重参数化技巧和蒙特卡洛估计联合优化ELBO关于ϕθ的参数。

After training a VAE,generating new data can be performed by sampling directly from the latent space p(z) and then running it through the decoder. Variational Autoencoders are particularly interesting when the dimensionality of z is less than that of input x ,as we might then be learning compact,useful representations. Furthermore, when a semantically meaningful latent space is learned, latent vectors can be edited before being passed to the decoder to more precisely control the data generated.

训练完VAE后,可以通过直接从潜在空间p(z)采样,然后通过解码器生成新数据。当z的维度小于输入x的维度时,变分自编码器尤其有趣,因为此时我们可能学到了紧凑且有用的表示。此外,当学习到语义上有意义的潜在空间时,可以在传递给解码器之前编辑潜在向量,以更精确地控制生成的数据。

Hierarchical Variational Autoencoders

分层变分自编码器

A Hierarchical Variational Autoencoder (HVAE) [2,3] is a generalization of a VAE that extends to multiple hierarchies over latent variables. Under this formulation, latent variables themselves are interpreted as generated from other higher-level, more abstract latents. Intuitively, just as we treat our three-dimensional observed objects as generated from a higher-level abstract latent, the people in Plato's cave treat three-dimensional objects as latents that generate their two-dimensional observations. Therefore, from the perspective of Plato's cave dwellers, their observations can be treated as modeled by a latent hierarchy of depth two (or more).

层次变分自编码器(Hierarchical Variational Autoencoder,HVAE)[2,3] 是变分自编码器(VAE)的推广,扩展到潜变量的多层次结构。在该框架下,潜变量本身被解释为由更高层次、更抽象的潜变量生成。直观地说,就像我们将三维观测对象视为由更高层次的抽象潜变量生成一样,柏拉图洞穴中的人们将三维物体视为生成其二维观测的潜变量。因此,从柏拉图洞穴居民的视角来看,他们的观测可以被视为由深度为二层(或更多层)的潜变量层次结构建模。

Whereas in the general HVAE with T hierarchical levels,each latent is allowed to condition on all previous latents, in this work we focus on a special case which we call a Markovian HVAE (MHVAE). In a MHVAE, the generative process is a Markov chain; that is, each transition down the hierarchy is Markovian, where

在具有T层层次结构的一般HVAE中,每个潜变量允许依赖于所有先前的潜变量,而在本工作中,我们关注一种特殊情况,称为马尔可夫HVAE(Markovian HVAE,MHVAE)。在MHVAE中,生成过程是一个马尔可夫链;即层次结构中的每次转移都是马尔可夫性质,其中

Figure 2: A Markovian Hierarchical Variational Autoencoder with T hierarchical latents. The generative process is modeled as a Markov chain,where each latent zt is generated only from the previous latent zt+1 .

图2:具有T层次潜变量的马尔可夫层次变分自编码器。生成过程被建模为马尔可夫链,其中每个潜变量zt仅由前一潜变量zt+1生成。

decoding each latent zt only conditions on previous latent zt+1 . Intuitively,and visually,this can be seen as simply stacking VAEs on top of each other, as depicted in Figure 2; another appropriate term describing this model is a Recursive VAE. Mathematically, we represent the joint distribution and the posterior of a Markovian HVAE as:

解码每个潜变量zt仅依赖于前一个潜变量zt+1。直观且形象地看,这可以视为简单地将多个VAE堆叠在一起,如图2所示;描述该模型的另一个合适术语是递归VAE。从数学上,我们将马尔可夫HVAE的联合分布和后验表示为:

(23)p(x,z1:T)=p(zT)pθ(xz1)t=2Tpθ(zt1zt)

(24)qϕ(z1:Tx)=qϕ(z1x)t=2Tqϕ(ztzt1)

Then, we can easily extend the ELBO to be:

然后,我们可以轻松地将ELBO扩展为:

(25)logp(x)=logp(x,z1:T)dz1:T (Apply Equation 1) 

(26)=logp(x,z1:T)qϕ(z1:Tx)qϕ(z1:Tx)dz1:T(Multiply by 1 =qϕ(z1:Tx)qϕ(z1:Tx))

(27)=logEqϕ(z1:Tx)[p(x,z1:T)qϕ(z1:Tx)] (Definition of Expectation) 

(28)Eqϕ(z1:Tx)[logp(x,z1:T)qϕ(z1:Tx)](Apply Jensen’s Inequality)

We can then plug our joint distribution (Equation 23) and posterior (Equation 24) into Equation 28 to produce an alternate form:

接着,我们可以将联合分布(方程23)和后验(方程24)代入方程28,得到另一种形式:

(29)Eqϕ(z1:Tx)[logp(x,z1:T)qϕ(z1:Tx)]=Eqϕ(z1:Tx)[logp(zT)pθ(xz1)t=2Tpθ(zt1zt)qϕ(z1x)t=2Tqϕ(ztzt1)]

As we will show below, when we investigate Variational Diffusion Models, this objective can be further decomposed into interpretable components.

正如我们将在下文展示的,当研究变分扩散模型时,该目标函数可以进一步分解为可解释的组成部分。

Variational Diffusion Models

变分扩散模型

The easiest way to think of a Variational Diffusion Model (VDM) [4, 5, 6] is simply as a Markovian Hierarchical Variational Autoencoder with three key restrictions:

理解变分扩散模型(Variational Diffusion Model,VDM)[4,5,6]的最简单方式是将其视为具有三个关键限制的马尔可夫层次变分自编码器:

Figure 3: A visual representation of a Variational Diffusion Model; x0 represents true data observations such as natural images, xT represents pure Gaussian noise,and xt is an intermediate noisy version of x0 . Each q(xtxt1) is modeled as a Gaussian distribution that uses the output of the previous state as its mean.

图3:变分扩散模型的可视化表示;x0代表真实数据观测,如自然图像,xT代表纯高斯噪声,xtx0的中间噪声版本。每个q(xtxt1)被建模为以前一状态输出为均值的高斯分布。

Let us expand on the implications of these assumptions. From the first restriction, with some abuse of notation,we can now represent both true data samples and latent variables as xt ,where t=0 represents true data samples and t[1,T] represents a corresponding latent with hierarchy indexed by t . The VDM posterior is the same as the MHVAE posterior (Equation 24), but can now be rewritten as:

让我们详细说明这些假设的含义。根据第一个限制,略微滥用符号,我们现在可以将真实数据样本和潜变量统一表示为xt,其中t=0代表真实数据样本,t[1,T]代表对应的潜变量,层次由t索引。VDM的后验与MHVAE的后验(方程24)相同,但现在可以重写为:

(30)q(x1:Tx0)=t=1Tq(xtxt1)

From the second assumption, we know that the distribution of each latent variable in the encoder is a Gaussian centered around its previous hierarchical latent. Unlike a Markovian HVAE, the structure of the encoder at each timestep t is not learned; it is fixed as a linear Gaussian model,where the mean and standard deviation can be set beforehand as hyperparameters [5], or learned as parameters [6]. We parameterize the Gaussian encoder with mean μt(xt)=αtxt1 ,and variance t(xt)=(1αt)I ,where the form of the coefficients are chosen such that the variance of the latent variables stay at a similar scale; in other words, the encoding process is variance-preserving. Note that alternate Gaussian parameterizations are allowed, and lead to similar derivations. The main takeaway is that αt is a (potentially learnable) coefficient that can vary with the hierarchical depth t ,for flexibility. Mathematically,encoder transitions are denoted as:

根据第二个假设,我们知道编码器中每个潜变量的分布是以其前一层级潜变量为中心的高斯分布。与马尔可夫层级变分自编码器(Markovian HVAE)不同,编码器在每个时间步的结构t不是通过学习获得的;它被固定为线性高斯模型,其中均值和标准差可以预先作为超参数设定[5],也可以作为参数学习[6]。我们用均值μt(xt)=αtxt1和方差t(xt)=(1αt)I参数化高斯编码器,系数的形式选择使得潜变量的方差保持在相似的尺度;换句话说,编码过程是方差保持的。注意,允许使用其他高斯参数化方式,且会导致类似的推导。主要结论是,αt是一个(可能可学习的)系数,可以随层级深度t变化,以增加灵活性。从数学上讲,编码器的转移表示为:

(31)q(xtxt1)=N(xt;αtxt1,(1αt)I)

From the third assumption,we know that αt evolves over time according to a fixed or learnable schedule structured such that the distribution of the final latent p(xT) is a standard Gaussian. We can then update the joint distribution of a Markovian HVAE (Equation 23) to write the joint distribution for a VDM as:

根据第三个假设,我们知道αt随时间按照固定或可学习的调度演变,该调度结构使得最终潜变量p(xT)的分布为标准高斯。然后,我们可以更新马尔可夫HVAE的联合分布(方程23),以写出VDM的联合分布为:

(32)p(x0:T)=p(xT)t=1Tpθ(xt1xt)

where,

其中,

(33)p(xT)=N(xT;0,I)

Collectively, what this set of assumptions describes is a steady noisification of an image input over time; we progressively corrupt an image by adding Gaussian noise until eventually it becomes completely identical to pure Gaussian noise. Visually, this process is depicted in Figure 3.

总体而言,这组假设描述了图像输入随时间逐渐加噪的过程;我们通过逐步添加高斯噪声来逐渐破坏图像,直到最终它完全变成纯高斯噪声。从视觉上看,该过程如图3所示。

Note that our encoder distributions q(xtxt1) are no longer parameterized by ϕ ,as they are completely modeled as Gaussians with defined mean and variance parameters at each timestep. Therefore, in a VDM, we are only interested in learning conditionals pθ(xt1xt) ,so that we can simulate new data. After optimizing the VDM,the sampling procedure is as simple as sampling Gaussian noise from p(xT) and iteratively running the denoising transitions pθ(xt1xt) for T steps to generate a novel x0 .

注意,我们的编码器分布q(xtxt1)不再由ϕ参数化,因为它们在每个时间步完全被定义均值和方差参数的高斯分布建模。因此,在VDM中,我们只关注学习条件分布pθ(xt1xt),以便模拟新数据。优化VDM后,采样过程非常简单,即从p(xT)采样高斯噪声,并迭代运行去噪转移pθ(xt1xt)T步,以生成新的x0

Like any HVAE, the VDM can be optimized by maximizing the ELBO, which can be derived as: Visually, this interpretation of the ELBO is depicted in Figure 4. The cost of optimizing a VDM is primarily dominated by the third term,since we must optimize over all timesteps t .

像任何HVAE一样,VDM可以通过最大化证据下界(ELBO)进行优化,其推导如下:从视觉上看,ELBO的这种解释如图4所示。优化VDM的主要开销由第三项主导,因为我们必须对所有时间步t进行优化。


(34)logp(x)=logp(x0:T)dx1:T

(35)=logp(x0:T)q(x1:Tx0)q(x1:Tx0)dx1:T

(36)=logEq(x1:Tx0)[p(x0:T)q(x1:Tx0)]

(37)Eq(x1:Tx0)[logp(x0:T)q(x1:Tx0)]

(38)=Eq(x1:Tx0)[logp(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1)]

(39)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)t=2Tpθ(xt1xt)q(xTxT1)t=1T1q(xtxt1)]

(40)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)t=1T1pθ(xtxt+1)q(xTxT1)t=1T1q(xtxt1)]

(41)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)q(xTxT1)]+Eq(x1:Tx0)[logt=1T1pθ(xtxt+1)q(xtxt1)]

(42)=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logp(xT)q(xTxT1)]+Eq(x1:Tx0)[t=1T1logpθ(xtxt+1)q(xtxt1)]

(43)=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logp(xT)q(xTxT1)]+t=1T1Eq(x1:Tx0)[logpθ(xtxt+1)q(xtxt1)]

(44)=Eq(x1x0)[logpθ(x0x1)]+Eq(xT1,xTx0)[logp(xT)q(xTxT1)]+t=1T1Eq(xt1,xt,xt+1x0)[logpθ(xtxt+1)q(xtxt1)]

(45)=Eq(x1x0)[logpθ(x0x1)]reconstruction term Eq(xT1x0)[DKL(q(xTxT1)p(xT))]prior matching term 

t=1T1Eq(xt1,xt+1x0)[DKL(q(xtxt1)pθ(xtxt+1))]consistency term 


The derived form of the ELBO can be interpreted in terms of its individual components:

ELBO的推导形式可以从其各个组成部分进行解释:

  1. Eq(x1x0)[logpθ(x0x1)] can be interpreted as a reconstruction term,predicting the log probability of the original data sample given the first-step latent. This term also appears in a vanilla VAE, and can be trained similarly.
  1. Eq(x1x0)[logpθ(x0x1)]可以解释为重构项,预测给定第一步潜变量的原始数据样本的对数概率。该项也出现在普通VAE中,且可以类似训练。
  1. Eq(xT1x0)[DKL(q(xTxT1)p(xT))] is a prior matching term; it is minimized when the final latent distribution matches the Gaussian prior. This term requires no optimization, as it has no trainable parameters; furthermore,as we have assumed a large enough T such that the final distribution is Gaussian, this term effectively becomes zero.
  1. Eq(xT1x0)[DKL(q(xTxT1)p(xT))] 是一个先验匹配项;当最终的潜在分布与高斯先验匹配时,该项被最小化。此项无需优化,因为它没有可训练参数;此外,由于我们假设 T 足够大,使得最终分布为高斯分布,该项实际上变为零。
  1. Eq(xt1,xt+1x0)[DKL(q(xtxt1)pθ(xtxt+1))] is a consistency term; it endeavors to make the distribution at xt consistent,from both forward and backward processes. That is,a denoising step from a noisier image should match the corresponding noising step from a cleaner image, for every intermediate timestep; this is reflected mathematically by the KL Divergence. This term is minimized when we train pθ(xtxt+1) to match the Gaussian distribution q(xtxt1) ,which is defined in Equation 31.
  1. Eq(xt1,xt+1x0)[DKL(q(xtxt1)pθ(xtxt+1))] 是一个一致性项;它致力于使 xt 处的分布在正向和反向过程之间保持一致。也就是说,每个中间时间步中,从更噪声图像的去噪步骤应与从更干净图像的加噪步骤相匹配;这在数学上通过KL散度体现。当我们训练 pθ(xtxt+1) 以匹配方程31中定义的高斯分布 q(xtxt1) 时,该项被最小化。

Figure 4: Under our first derivation,a VDM can be optimized by ensuring that for every intermediate xt , the posterior from the latent above it pθ(xtxt+1) matches the Gaussian corruption of the latent before it q(xtxt1) . In this figure,for each intermediate xt ,we minimize the difference between the distributions represented by the pink and green arrows.

图4:根据我们的首次推导,VDM(变分扩散模型)可以通过确保每个中间 xt 处,其上层潜变量的后验 pθ(xtxt+1) 与其下层潜变量的高斯扰动 q(xtxt1) 匹配来优化。在此图中,对于每个中间 xt,我们最小化由粉色和绿色箭头表示的分布之间的差异。

Under this derivation, all terms of the ELBO are computed as expectations, and can therefore be approximated using Monte Carlo estimates. However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables {xt1,xt+1} for every timestep,the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep. As it is computed by summing up T1 consistency terms,the final estimated value of the ELBO may have high variance for large T values.

根据此推导,ELBO的所有项均以期望形式计算,因此可以用蒙特卡洛估计进行近似。然而,实际使用我们刚推导出的项来优化ELBO可能并非最优;因为一致性项是对每个时间步的两个随机变量 {xt1,xt+1} 的期望,其蒙特卡洛估计的方差可能高于仅使用每个时间步一个随机变量估计的项。由于它是通过累加 T1 个一致性项计算的,ELBO的最终估计值在 T 较大时可能具有较高的方差。

Let us instead try to derive a form for our ELBO where each term is computed as an expectation over only one random variable at a time. The key insight is that we can rewrite encoder transitions as q(xtxt1)= q(xtxt1,x0) ,where the extra conditioning term is superfluous due to the Markov property. Then,according to Bayes rule, we can rewrite each transition as:

我们不妨尝试推导一种ELBO形式,使每项仅对一个随机变量的期望进行计算。关键见解是,我们可以将编码器转移重写为 q(xtxt1)= q(xtxt1,x0),其中由于马尔可夫性质,额外的条件项是多余的。然后,根据贝叶斯定理,我们可以将每个转移重写为:

(46)q(xtxt1,x0)=q(xt1xt,x0)q(xtx0)q(xt1x0)

Armed with this new equation, we can retry the derivation resuming from the ELBO in Equation 37:

利用这个新方程,我们可以从方程37中的ELBO重新开始推导:

(47)logp(x)Eq(x1:Tx0)[logp(x0:T)q(x1:Tx0)]

(48)=Eq(x1:Tx0)[logp(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1)]

(49)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)t=2Tpθ(xt1xt)q(x1x0)t=2Tq(xtxt1)]

(50)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)t=2Tpθ(xt1xt)q(x1x0)t=2Tq(xtxt1,x0)]

(51)=Eq(x1:Tx0)[logpθ(xT)pθ(x0x1)q(x1x0)+logt=2Tpθ(xt1xt)q(xtxt1,x0)]

(52)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)q(x1x0)+logt=2Tpθ(xt1xt)q(xt1xt,x0)q(xtx0)q(xt1x0)]

(53)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)q(x1x0)+logt=2Tpθ(xt1xt)q(xt1xt,x0)q(xtx0)q(xt1x0)]

(54)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)q(x1x0)+logq(x1x0)q(xTx0)+logt=2Tpθ(xt1xt)q(xt1xt,x0)]

(55)=Eq(x1:Tx0)[logp(xT)pθ(x0x1)q(xTx0)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)]

(56)=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logp(xT)q(xTx0)]+t=2TEq(x1:Tx0)[logpθ(xt1xt)q(xt1xt,x0)]

(57)=Eq(x1x0)[logpθ(x0x1)]+Eq(xTx0)[logp(xT)q(xTx0)]+t=2TEq(xt,xt1x0)[logpθ(xt1xt)q(xt1xt,x0)]

(58)=Eq(x1x0)[logpθ(x0x1)]reconstruction term DKL(q(xTx0)p(xT))prior matching term t=2TEq(xtx0)[DKL(q(xt1xt),x0)pθ(xt1xt)]denoising matching term 

We have therefore successfully derived an interpretation for the ELBO that can be estimated with lower variance, as each term is computed as an expectation of at most one random variable at a time. This formulation also has an elegant interpretation, which is revealed when inspecting each individual term:

因此,我们成功推导出了一个对ELBO的解释,该解释可以通过方差更低的估计来实现,因为每一项都是对最多一个随机变量的期望计算。该表达式还有一个优雅的解释,在检查每个单独项时可以显现出来:

  1. Eq(x1x0)[logpθ(x0x1)] can be interpreted as a reconstruction term; like its analogue in the ELBO c a vanilla VAE, this term can be approximated and optimized using a Monte Carlo estimate.
  1. Eq(x1x0)[logpθ(x0x1)] 可以被解释为重构项;类似于ELBO中普通VAE的对应项,该项可以通过蒙特卡洛估计进行近似和优化。
  1. DKL(q(xTx0)p(xT)) represents how close the distribution of the final noisified input is to the standard Gaussian prior. It has no trainable parameters, and is also equal to zero under our assumptions.
  1. DKL(q(xTx0)p(xT)) 表示最终加噪输入的分布与标准高斯先验的接近程度。它没有可训练参数,并且在我们的假设下也等于零。
  1. Eq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))] is a denoising matching term. We learn desired denoising transition step pθ(xt1xt) as an approximation to tractable,ground-truth denoising transition step q(xt1xt,x0) . The q(xt1xt,x0) transition step can act as a ground-truth signal,since it defines how to denoise a noisy image xt with access to what the final,completely denoised image x0 should be. This term is therefore minimized when the two denoising steps match as closely as possible, as measured by their KL Divergence.
  1. Eq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))] 是一个去噪匹配项。我们学习期望的去噪转移步骤 pθ(xt1xt),作为可处理的真实去噪转移步骤 q(xt1xt,x0) 的近似。q(xt1xt,x0) 转移步骤可以作为真实信号,因为它定义了如何利用最终完全去噪图像 x0 来去噪带噪声的图像 xt。因此,当两个去噪步骤通过它们的KL散度尽可能匹配时,该项达到最小。

As a side note, one observes that in the process of both ELBO derivations (Equation 45 and Equation 58), only the Markov assumption is used; as a result these formulae will hold true for any arbitrary Markovian HVAE. Furthermore,when we set T=1 ,both of the ELBO interpretations for a VDM exactly recreate the ELBO equation of a vanilla VAE, as written in Equation 19.

顺便提一下,可以观察到在两个ELBO推导过程中(方程45和方程58),仅使用了马尔可夫假设;因此这些公式对任意马尔可夫隐变量变分自编码器(HVAE)均成立。此外,当我们设定 T=1 时,VDM的两个ELBO解释完全重现了普通VAE的ELBO方程,如方程19所示。

In this derivation of the ELBO, the bulk of the optimization cost once again lies in the summation term, which dominates the reconstruction term. Whereas each KL Divergence term DKL(q(xt1xt,x0)pθ(xt1xt)) is difficult to minimize for arbitrary posteriors in arbitrarily complex Markovian HVAEs due to the added complexity of simultaneously learning the encoder, in a VDM we can leverage the Gaussian transition assumption to make optimization tractable. By Bayes rule, we have:

在该ELBO的推导中,优化成本的主要部分仍然集中在求和项上,该项主导了重构项。由于同时学习编码器带来的复杂性,对于任意后验和任意复杂的马尔可夫HVAE,每个KL散度项 DKL(q(xt1xt,x0)pθ(xt1xt)) 都难以最小化,而在VDM中,我们可以利用高斯转移假设使优化变得可行。根据贝叶斯定理,我们有:

q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)

As we already know that q(xtxt1,x0)=q(xtxt1)=N(xt;αtxt1,(1αt)I) from our assumption regarding encoder transitions (Equation 31),what remains is deriving for the forms of q(xtx0) and q(xt1x0) . Fortunately, these are also made tractable by utilizing the fact that the encoder transitions of a VDM are linear Gaussian models. Recall that under the reparameterization trick,samples xtq(xtxt1) can be rewritten as:

正如我们从关于编码器转移的假设(方程31)中已经知道的那样,剩下的就是推导q(xtx0)q(xt1x0)的形式。幸运的是,利用变分扩散模型(VDM)编码器转移是线性高斯模型的事实,这些推导也变得可行。回想在重参数化技巧下,样本xtq(xtxt1)可以重写为:

(59)xt=αtxt1+1αtϵ with ϵN(ϵ;0,I)

and that similarly,samples xt1q(xt1xt2) can be rewritten as:

同样,样本xt1q(xt1xt2)也可以重写为:

(60)xt1=αt1xt2+1αt1ϵ with ϵN(ϵ;0,I)

Figure 5: Depicted is an alternate, lower-variance method to optimize a VDM; we compute the form of ground-truth denoising step q(xt1xt,x0) using Bayes rule,and minimize its KL Divergence with our approximate denoising step pθ(xt1xt) . This is once again denoted visually by matching the distributions represented by the green arrows with those of the pink arrows. Artistic liberty is at play here; in the full picture,each pink arrow must also stem from x0 ,as it is also a conditioning term.

图5:展示了一种替代的、低方差的方法来优化VDM;我们使用贝叶斯定理计算真实去噪步骤q(xt1xt,x0)的形式,并最小化其与我们近似去噪步骤pθ(xt1xt)的KL散度。视觉上,这再次通过匹配绿色箭头所表示的分布与粉色箭头所表示的分布来表示。这里存在艺术上的自由;在完整图中,每个粉色箭头也必须源自x0,因为它也是一个条件项。

Then,the form of q(xtx0) can be recursively derived through repeated applications of the reparameterization trick. Suppose that we have access to 2T random noise variables {ϵt,ϵt}t=0T iid N(ϵ;0,I) . Then,for an arbitrary sample xtq(xtx0) ,we can rewrite it as:

然后,q(xtx0)的形式可以通过反复应用重参数化技巧递归推导。假设我们可以访问2T个随机噪声变量{ϵt,ϵt}t=0T iid N(ϵ;0,I)。那么,对于任意样本xtq(xtx0),我们可以将其重写为:

(61)xt=αtxt1+1αtϵt1

(62)=αt(αt1xt2+1αt1ϵt2)+1αtϵt1

(63)=αtαt1xt2+αtαtαt1ϵt2+1αtϵt1

(64)=αtαt1xt2+αtαtαt12+1αt2ϵt2

(65)=αtαt1xt2+αtαtαt1+1αtϵt2

(66)=αtαt1xt2+1αtαt1ϵt2

(67)=

(68)=i=1tαix0+1i=1tαiϵ0

(69)=α¯tx0+1α¯tϵ0

(70)N(xt;α¯tx0,(1α¯t)I)

where in Equation 64 we have utilized the fact that the sum of two independent Gaussian random variables remains a Gaussian with mean being the sum of the two means, and variance being the sum of the two variances. Interpreting 1αtϵt1 as a sample from Gaussian N(0,(1αt)I) ,and αtαtαt1ϵt2 as a sample from Gaussian N(0,(αtαtαt1)I) ,we can then treat their sum as a random variable sampled from Gaussian N(0,(1αt+αtαtαt1)I)=N(0,(1αtαt1)I) . A sample from this distribution can then be represented using the reparameterization trick as 1αtαt1ϵt2 ,as in Equation 66.

在公式64中,我们利用了两个独立高斯随机变量之和仍为高斯分布的事实,其均值为两个均值之和,方差为两个方差之和。将1αtϵt1解释为来自高斯分布N(0,(1αt)I)的样本,αtαtαt1ϵt2解释为来自高斯分布N(0,(αtαtαt1)I)的样本,则它们的和可以视为从高斯分布N(0,(1αt+αtαtαt1)I)=N(0,(1αtαt1)I)中采样的随机变量。该分布的样本可以通过重参数化技巧表示为1αtαt1ϵt2,如公式66所示。

We have therefore derived the Gaussian form of q(xtx0) . This derivation can be modified to also yield the Gaussian parameterization describing q(xt1x0) . Now,knowing the forms of both q(xtx0) and q(xt1x0) we can proceed to calculate the form of q(xt1xt,x0) by substituting into the Bayes rule expansion:

因此,我们推导出了q(xtx0)的高斯形式。该推导也可以修改以得到描述q(xt1x0)的高斯参数化。现在,已知q(xtx0)q(xt1x0)的形式,我们可以通过代入贝叶斯规则展开式来计算q(xt1xt,x0)的形式:

(71)q(xt1xt,x0)=q(xtxt1,x0)q(xt1x0)q(xtx0)

(72)=N(xt;αtxt1,(1αt)I)N(xt1;α¯t1x0,(1α¯t1)I)N(xt;α¯tx0,(1α¯t)I)

(73)exp{[(xtαtxt1)22(1αt)+(xt1α¯t1x0)22(1α¯t1)(xtα¯tx0)22(1α¯t)]}

(74)=exp{12[(xtαtxt1)21αt+(xt1α¯t1x0)21α¯t1(xtα¯tx0)21α¯t]}

(75)=exp{12[(2αtxtxt1+αtxt12)1αt+(xt122α¯t1xt1x0)1α¯t1+C(xt,x0)]}

(76)exp{12[2αtxtxt11αt+αtxt121αt+xt121α¯t12α¯t1xt1x01α¯t1]}

(77)=exp{12[(αt1αt+11α¯t1)xt122(αtxt1αt+α¯t1x01α¯t1)xt1]}

(78)=exp{12[αt(1α¯t1)+1αt(1αt)(1α¯t1)xt122(αtxt1αt+α¯t1x01α¯t1)xt1]}

(79)=exp{12[αtα¯t+1αt(1αt)(1α¯t1)xt122(αtxt1αt+α¯t1x01α¯t1)xt1]}

(80)=exp{12[1α¯t(1αt)(1α¯t1)xt122(αtxt1αt+α¯t1x01α¯t1)xt1]}

(81)=exp{12(1α¯t(1αt)(1α¯t1))[xt122(αtxt1α+αt1x01αt1)1α¯t(1αt)(1α¯t1)xt1]}

(82)=exp{12(1α¯t(1αt)(1α¯t1))[xt122(αtxt1αt+αt1x01α¯t1)(1αt)(1α¯t1)1α¯txt1]}

(83)=exp{12(1(1αt)(1α¯t1)1α¯t)[xt122αt(1α¯t1)xt+α¯t1(1αt)x01α¯txt1]}

(84)N(xt1;αt(1α¯t1)xt+α¯t1(1αt)x01α¯tμq(xt,x0),(1αt)(1α¯t1)1α¯tq(t)I)

where in Equation 75, C(xt,x0) is a constant term with respect to xt1 computed as a combination of only xt,x0 ,and α values; this term is implicitly returned in Equation 84 to complete the square.

在公式75中,C(xt,x0)是相对于xt1的常数项,仅由xt,x0α的值组合计算得出;该项在公式84中隐式返回以完成平方。

We have therefore shown that at each step, xt1q(xt1xt,x0) is normally distributed,with mean μq(xt,x0) that is a function of xt and x0 ,and variance q(t) as a function of α coefficients. These α coefficients are known and fixed at each timestep; they are either set permanently when modeled as hyperparameters, or treated as the current inference output of a network that seeks to model them. Following Equation 84,we can rewrite our variance equation as q(t)=σq2(t)I ,where:

因此,我们已经证明在每一步中,xt1q(xt1xt,x0)服从正态分布,其均值μq(xt,x0)xtx0的函数,方差q(t)α系数的函数。这些α系数在每个时间步都是已知且固定的;它们要么作为超参数被永久设定,要么作为试图建模它们的网络当前推断的输出。根据公式84,我们可以将方差方程重写为q(t)=σq2(t)I,其中:

(85)σq2(t)=(1αt)(1α¯t1)1α¯t

In order to match approximate denoising transition step pθ(xt1xt) to ground-truth denoising transition step q(xt1xt,x0) as closely as possible,we can also model it as a Gaussian. Furthermore,as all α terms are known to be frozen at each timestep, we can immediately construct the variance of the approximate denoising transition step to also be q(t)=σq2(t)I . We must parameterize its mean μθ(xt,t) as a function of xt ,however,since pθ(xt1xt) does not condition on x0 . Recall that the KL Divergence between two Gaussian distributions is:

为了使近似去噪转移步骤pθ(xt1xt)尽可能接近真实去噪转移步骤q(xt1xt,x0),我们也可以将其建模为高斯分布。此外,由于所有α项在每个时间步都已知且固定,我们可以立即构造近似去噪转移步骤的方差也为q(t)=σq2(t)I。然而,我们必须将其均值μθ(xt,t)参数化为xt的函数,因为pθ(xt1xt)不依赖于x0。回顾两个高斯分布之间的KL散度公式为:

(86)DKL(N(x;μx,x)N(y;μy,y))=12[log|y||x|d+tr(y1x)+(μyμx)Ty1(μyμx)]

In our case, where we can set the variances of the two Gaussians to match exactly, optimizing the KL Divergence term reduces to minimizing the difference between the means of the two distributions:

在我们的情况下,可以将两个高斯分布的方差设置为完全相同,优化KL散度项即简化为最小化两个分布均值之间的差异:

argminθDKL(q(xt1xt,x0)pθ(xt1xt))

(87)=argminθDKL(N(xt1;μq,q(t))N(xt1;μθ,q(t)))

(88)=argminθ12[log|q(t)||q(t)|d+tr(q(t)1q(t))+(μθμq)Tq(t)1(μθμq)]

(89)=argminθ12[log1d+d+(μθμq)Tq(t)1(μθμq)]

(90)=argminθ12[(μθμq)Tq(t)1(μθμq)]

(91)=argminθ12[(μθμq)T(σq2(t)I)1(μθμq)]

(92)=argminθ12σq2(t)[μθμq22]

where we have written μq as shorthand for μq(xt,x0) ,and μθ as shorthand for μθ(xt,t) for brevity. In other words,we want to optimize a μθ(xt,t) that matches μq(xt,x0) ,which from our derived Equation 84, takes the form:

这里我们用μq作为μq(xt,x0)的简写,用μθ作为μθ(xt,t)的简写,简洁起见。换句话说,我们想要优化一个与μq(xt,x0)匹配的μθ(xt,t),根据我们推导的方程84,其形式为:

(93)μq(xt,x0)=αt(1α¯t1)xt+α¯t1(1αt)x01α¯t

As μθ(xt,t) also conditions on xt ,we can match μq(xt,x0) closely by setting it to the following form:

由于μθ(xt,t)也以xt为条件,我们可以通过将μq(xt,x0)设为以下形式来使其紧密匹配:

(94)μθ(xt,t)=αt(1α¯t1)xt+α¯t1(1αt)x^θ(xt,t)1α¯t

where x^θ(xt,t) is parameterized by a neural network that seeks to predict x0 from noisy image xt and time index t . Then,the optimization problem simplifies to:

其中x^θ(xt,t)由一个神经网络参数化,该网络旨在从带噪声的图像xt和时间索引t预测x0。那么,优化问题简化为:

argminθDKL(q(xt1xt,x0)pθ(xt1xt))

(95)=argminθDKL(N(xt1;μq,q(t))N(xt1;μθ,q(t)))

(96)=argminθ12σq2(t)[αt(1α¯t1)xt+α¯t1(1αt)x^θ(xt,t)1α¯tαt(1α¯t1)xt+α¯t1(1αt)x01α¯t22]

(97)=argminθ12σq2(t)[α¯t1(1αt)x^θ(xt,t)1α¯tα¯t1(1αt)x01α¯t22]

(98)=argminθ12σq2(t)[α¯t1(1αt)1α¯t(x^θ(xt,t)x0)22]

(99)=argminθ12σq2(t)α¯t1(1αt)2(1α¯t)2[x^θ(xt,t)x022]

Therefore, optimizing a VDM boils down to learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it [5]. Furthermore, minimizing the summation term of our derived ELBO objective (Equation 58) across all noise levels can be approximated by minimizing the expectation over all timesteps:

因此,优化一个变分扩散模型(VDM)归结为学习一个神经网络,从任意加噪版本的图像中预测原始真实图像[5]。此外,通过最小化我们推导的ELBO目标(方程58)在所有噪声水平上的求和项,可以近似为最小化所有时间步的期望值:

(100)argminθEtU{2,T}[Eq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]]

which can then be optimized using stochastic samples over timesteps.

然后可以通过对时间步的随机采样来优化该目标。

Learning Diffusion Noise Parameters

学习扩散噪声参数

Let us investigate how the noise parameters of a VDM can be jointly learned. One potential approach is to model αt using a neural network α^η(t) with parameters η . However,this is inefficient as inference must be performed multiple times at each timestep t to compute α¯t . Whereas caching can mitigate this computational cost, we can also derive an alternate way to learn the diffusion noise parameters. By substituting our variance equation from Equation 85 into our derived per-timestep objective in Equation 99, we can reduce:

让我们探讨如何联合学习VDM的噪声参数。一种可能的方法是使用参数为η的神经网络α^η(t)来建模αt。然而,这种方法效率低下,因为必须在每个时间步t多次进行推断以计算α¯t。虽然缓存可以缓解这一计算成本,我们还可以推导出另一种学习扩散噪声参数的方法。通过将方程85中的方差方程代入我们推导的每时间步目标方程99,可以简化为:

(101)12σq2(t)α¯t1(1αt)2(1α¯t)2[x^θ(xt,t)x022]=12(1αt)(1α¯t1)1αtα¯t1(1αt)2(1α¯t)2[x^θ(xt,t)x022]

(102)=121α¯t(1αt)(1α¯t1)α¯t1(1αt)2(1α¯t)2[x^θ(xt,t)x022]

(103)=12α¯t1(1αt)(1α¯t1)(1α¯t)[x^θ(xt,t)x022]

(104)=12α¯t1α¯t(1α¯t1)(1α¯t)[x^θ(xt,t)x022]

(105)=12α¯t1α¯t1α¯t+α¯t1α¯tα¯t(1α¯t1)(1α¯t)[x^θ(xt,t)x022]

(106)=12α¯t1(1α¯t)α¯t(1α¯t1)(1α¯t1)(1α¯t)[x^θ(xt,t)x022]

(107)=12(α¯t1(1α¯t)(1α¯t1)(1α¯t)α¯t(1α¯t1)(1α¯t1)(1α¯t))[x^θ(xt,t)x022]

(108)=12(α¯t11α¯t1α¯t1α¯t)[x^θ(xt,t)x022]

Recall from Equation 70 that q(xtx0) is a Gaussian of form N(xt;α¯tx0,(1α¯t)I) . Then,following the definition of the signal-to-noise ratio (SNR) as SNR=μ2σ2 ,we can write the SNR at each timestep t as:

回顾方程70,q(xtx0)是形式为N(xt;α¯tx0,(1α¯t)I)的高斯分布。然后,按照信噪比(SNR)定义为SNR=μ2σ2,我们可以将每个时间步t的SNR写为:

(109)SNR(t)=α¯t1α¯t

Then, our derived Equation 108 (and Equation 99) can be simplified as:

接着,我们推导的方程108(及方程99)可以简化为:

(110)12σq2(t)α¯t1(1αt)2(1α¯t)2[x^θ(xt,t)x022]=12(SNR(t1)SNR(t))[x^θ(xt,t)x022]

As the name implies, the SNR represents the ratio between the original signal and the amount of noise present; a higher SNR represents more signal and a lower SNR represents more noise. In a diffusion model, we require the SNR to monotonically decrease as timestep t increases; this formalizes the notion that perturbed input xt becomes increasingly noisy over time,until it becomes identical to a standard Gaussian at t=T .

顾名思义,信噪比表示原始信号与噪声量的比率;较高的SNR表示信号较强,较低的SNR表示噪声较多。在扩散模型中,我们要求SNR随着时间步t的增加单调递减;这形式化了扰动输入xt随时间变得越来越嘈杂的概念,直到在t=T时变得与标准高斯分布相同。

Following the simplification of the objective in Equation 110, we can directly parameterize the SNR at each timestep using a neural network, and learn it jointly along with the diffusion model. As the SNR must monotonically decrease over time, we can represent it as:

根据方程110中目标的简化,我们可以直接用神经网络参数化每个时间步的SNR,并与扩散模型一起联合学习。由于SNR必须随时间单调递减,我们可以将其表示为:

(111)SNR(t)=exp(ωη(t))

where ωη(t) is modeled as a monotonically increasing neural network with parameters η . Negating ωη(t) results in a monotonically decreasing function, whereas the exponential forces the resulting term to be positive. Note that the objective in Equation 100 must now optimize over η as well. By combining our parameterization of SNR in Equation 111 with our definition of SNR in Equation 109, we can also explicitly derive elegant forms for the value of α¯t as well as for the value of 1α¯t :

其中 ωη(t) 被建模为一个参数为 η 的单调递增神经网络。对 ωη(t) 取反得到一个单调递减函数,而指数函数则保证结果项为正。注意,方程100中的目标现在也必须对 η 进行优化。通过将方程111中对信噪比(SNR)的参数化与方程109中对信噪比的定义结合,我们还可以显式推导出 α¯t1α¯t 的优雅表达式:

(112)α¯t1α¯t=exp(ωη(t))

(113)α¯t=sigmoid(ωη(t))

(114)1α¯t=sigmoid(ωη(t))

These terms are necessary for a variety of computations; for example, during optimization, they are used to create arbitrarily noisy xt from input x0 using the reparameterization trick,as derived in Equation 69.

这些项对于各种计算是必要的;例如,在优化过程中,它们被用来通过重参数化技巧从输入 x0 创建任意噪声的 xt,如方程69所推导。

Three Equivalent Interpretations

三种等价解释

As we previously proved, a Variational Diffusion Model can be trained by simply learning a neural network to predict the original natural image x0 from an arbitrary noised version xt and its time index t . However, x0 has two other equivalent parameterizations,which leads to two further interpretations for a VDM.

正如我们之前证明的,变分扩散模型(Variational Diffusion Model,VDM)可以通过学习一个神经网络来预测任意噪声版本 xt 及其时间索引 t 下的原始自然图像 x0 来训练。然而,x0 还有另外两种等价的参数化方式,这导致了对VDM的另外两种解释。

Firstly,we can utilize the reparameterization trick. In our derivation of the form of q(xtx0) ,we can rearrange Equation 69 to show that:

首先,我们可以利用重参数化技巧。在对 q(xtx0) 形式的推导中,我们可以重新排列方程69,得到:

(115)x0=xt1α¯tϵ0α¯t

Plugging this into our previously derived true denoising transition mean μq(xt,x0) ,we can rederive as:

将其代入我们之前推导的真实去噪转移均值 μq(xt,x0),我们可以重新推导为:

(116)μq(xt,x0)=αt(1α¯t1)xt+α¯t1(1αt)x01α¯t

(117)=αt(1α¯t1)xt+α¯t1(1αt)xt1α¯tϵ0α¯t1α¯t

(118)=αt(1α¯t1)xt+(1αt)xt1α¯tϵ0αt1α¯t

(119)=αt(1α¯t1)xt1α¯t+(1αt)xt(1α¯t)αt(1αt)1α¯tϵ0(1α¯t)αt

(120)=(αt(1α¯t1)1α¯t+1αt(1α¯t)αt)xt(1αt)1α¯t(1α¯t)αtϵ0

(121)=(αt(1α¯t1)(1α¯t)αt+1αt(1α¯t)αt)xt1αt1α¯tαtϵ0

(122)=αtα¯t+1αt(1α¯t)αtxt1αt1α¯tαtϵ0

(123)=1α¯t(1α¯t)αtxt1αt1α¯tαtϵ0

(124)=1αtxt1αt1α¯tαtϵ0

Therefore,we can set our approximate denoising transition mean μθ(xt,t) as:

因此,我们可以将近似去噪转移均值 μθ(xt,t) 设定为:

(125)μθ(xt,t)=1αtxt1αt1α¯tαtϵ^θ(xt,t)

and the corresponding optimization problem becomes:

相应的优化问题变为:

argminθDKL(q(xt1xt,x0)pθ(xt1xt))

(126)=argminθDKL(N(xt1;μq,q(t))N(xt1;μθ,q(t)))

(127)=argminθ12σq2(t)[1αtxt1αt1α¯tαtϵ^θ(xt,t)1αtxt+1αt1α¯tαtϵ022]

(128)=argminθ12σq2(t)[1αt1α¯tαtϵ01αt1α¯tαtϵ^θ(xt,t)22]

(129)=argminθ12σq2(t)[1αt1α¯tαt(ϵ0ϵ^θ(xt,t))22]

(130)=argminθ12σq2(t)(1αt)2(1α¯t)αt[ϵ0ϵ^θ(xt,t)22]

Here, ϵ^θ(xt,t) is a neural network that learns to predict the source noise ϵ0N(ϵ;0,I) that determines xt from x0 . We have therefore shown that learning a VDM by predicting the original image x0 is equivalent to learning to predict the noise; empirically, however, some works have found that predicting the noise resulted in better performance [5,7] .

这里,ϵ^θ(xt,t) 是一个神经网络,学习从 x0 预测决定 xt 的源噪声 ϵ0N(ϵ;0,I)。因此,我们已经证明,通过预测原始图像 x0 来学习VDM等价于学习预测噪声;然而,经验上,一些研究发现预测噪声能带来更好的性能 [5,7]

To derive the third common interpretation of Variational Diffusion Models, we appeal to Tweedie's Formula [8]. In English, Tweedie's Formula states that the true mean of an exponential family distribution, given samples drawn from it, can be estimated by the maximum likelihood estimate of the samples (aka empirical mean) plus some correction term involving the score of the estimate. In the case of just one observed sample, the empirical mean is just the sample itself. It is commonly used to mitigate sample bias; if observed samples all lie on one end of the underlying distribution, then the negative score becomes large and corrects the naive maximum likelihood estimate of the samples towards the true mean.

为了推导变分扩散模型的第三种常见解释,我们借助了Tweedie公式[8]。通俗地说,Tweedie公式指出,给定从指数族分布中抽取的样本,其真实均值可以通过样本的最大似然估计(即经验均值)加上一个涉及估计得分函数的修正项来估计。在只有一个观测样本的情况下,经验均值即为该样本本身。该公式常用于缓解样本偏差;如果观测样本都位于潜在分布的一端,则负得分会变大,从而修正样本的朴素最大似然估计,使其更接近真实均值。

Mathematically,for a Gaussian variable zN(z;μz,z) ,Tweedie’s Formula states that:

数学上,对于高斯变量 zN(z;μz,z),Tweedie公式表述为:

E[μzz]=z+zzlogp(z)

In this case,we apply it to predict the true posterior mean of xt given its samples. From Equation 70,we know that:

在此情形下,我们应用该公式来预测给定样本的 xt 的真实后验均值。由方程70可知:

q(xtx0)=N(xt;α¯tx0,(1α¯t)I)

Then, by Tweedie's Formula, we have:

然后,根据Tweedie公式,我们有:

(131)E[μxtxt]=xt+(1α¯t)xtlogp(xt)

where we write xtlogp(xt) as logp(xt) for notational simplicity. According to Tweedie’s Formula,the best estimate for the true mean that xt is generated from, μxt=α¯tx0 ,is defined as:

我们将 xtlogp(xt) 写作 logp(xt) 以简化符号。根据 Tweedie 公式,生成 xt 的真实均值的最佳估计 μxt=α¯tx0 定义为:

(132)α¯tx0=xt+(1α¯t)logp(xt)

(133)x0=xt+(1α¯t)logp(xt)α¯t

Then,we can plug Equation 133 into our ground-truth denoising transition mean μq(xt,x0) once again and derive a new form:

然后,我们可以将方程133再次代入我们的真实去噪转移均值 μq(xt,x0),并推导出一种新形式:

(134)μq(xt,x0)=αt(1α¯t1)xt+α¯t1(1αt)x01α¯t

(135)=αt(1α¯t1)xt+α¯t1(1αt)xt+(1α¯t)logp(xt)α¯t1α¯t

(136)=αt(1α¯t1)xt+(1αt)xt+(1α¯t)logp(xt)αt1α¯t

(137)=αt(1α¯t1)xt1α¯t+(1αt)xt(1α¯t)αt+(1αt)(1α¯t)logp(xt)(1α¯t)αt

(138)=(αt(1α¯t1)1α¯t+1αt(1α¯t)αt)xt+1αtαtlogp(xt)

(139)=(αt(1α¯t1)(1α¯t)αt+1αt(1α¯t)αt)xt+1αtαtlogp(xt)

(140)=αtα¯t+1αt(1α¯t)αtxt+1αtαtlogp(xt)

(141)=1α¯t(1α¯t)αtxt+1αtαtlogp(xt)

(142)=1αtxt+1αtαtlogp(xt)

Therefore,we can also set our approximate denoising transition mean μθ(xt,t) as:

因此,我们也可以将近似去噪转移均值 μθ(xt,t) 设定为:

(143)μθ(xt,t)=1αtxt+1αtαtsθ(xt,t)

and the corresponding optimization problem becomes:

相应的优化问题变为:

argminθDKL(q(xt1xt,x0)pθ(xt1xt))

(144)=argminθDKL(N(xt1;μq,q(t))N(xt1;μθ,q(t)))

(145)=argminθ12σq2(t)[1αtxt+1αtαtsθ(xt,t)1αtxt1αtαtlogp(xt)22]

(146)=argminθ12σq2(t)[1αtαtsθ(xt,t)1αtαtlogp(xt)22]

(147)=argminθ12σq2(t)[1αtαt(sθ(xt,t)logp(xt))22]

(148)=argminθ12σq2(t)(1αt)2αt[sθ(xt,t)logp(xt)22]

Here, sθ(xt,t) is a neural network that learns to predict the score function xtlogp(xt) ,which is the gradient of xt in data space,for any arbitrary noise level t .

这里,sθ(xt,t) 是一个神经网络,学习预测得分函数 xtlogp(xt),即数据空间中 xt 的梯度,适用于任意噪声水平 t

The astute reader will notice that the score function logp(xt) looks remarkably similar in form to the source noise ϵ0 . This can be shown explicitly by combining Tweedie’s Formula (Equation 133) with the reparameterization trick (Equation 115):

细心的读者会注意到得分函数 logp(xt) 在形式上与源噪声 ϵ0 非常相似。通过结合 Tweedie 公式(方程133)和重参数化技巧(方程115),可以明确展示这一点:

(149)x0=xt+(1α¯t)logp(xt)α¯t=xt1α¯tϵ0α¯t

(150)(1α¯t)logp(xt)=1α¯tϵ0

(151)logp(xt)=11α¯tϵ0

As it turns out, the two terms are off by a constant factor that scales with time! The score function measures how to move in data space to maximize the log probability; intuitively, since the source noise is added to a natural image to corrupt it, moving in its opposite direction "denoises" the image and would be the best update to increase the subsequent log probability. Our mathematical proof justifies this intuition; we have explicitly shown that learning to model the score function is equivalent to modeling the negative of the source noise (up to a scaling factor).

事实证明,这两项相差一个随时间缩放的常数因子!得分函数衡量如何在数据空间中移动以最大化对数概率;直观上,由于源噪声被添加到自然图像中以使其受损,沿相反方向移动即“去噪”图像,是增加后续对数概率的最佳更新。我们的数学证明验证了这一直觉;我们明确展示了学习建模得分函数等价于建模源噪声的负值(仅差一个缩放因子)。

We have therefore derived three equivalent objectives to optimize a VDM: learning a neural network to predict the original image x0 ,the source noise ϵ0 ,or the score of the image at an arbitrary noise level logp(xt) . The VDM can be scalably trained by stochastically sampling timesteps t and minimizing the norm of the prediction with the ground truth target.

因此,我们推导出了优化变分扩散模型(VDM)的三种等价目标:学习神经网络预测原始图像 x0、源噪声 ϵ0 或任意噪声水平下图像的得分 logp(xt)。VDM 可通过随机采样时间步 t 并最小化预测与真实目标的范数来可扩展训练。

Score-based Generative Models

基于得分的生成模型

We have shown that a Variational Diffusion Model can be learned simply by optimizing a neural network sθ(xt,t) to predict the score function logp(xt) . However,in our derivation,the score term arrived from an application of Tweedie's Formula; this doesn't necessarily provide us with great intuition or insight into what exactly the score function is or why it is worth modeling. Fortunately, we can look to another class of generative models, Score-based Generative Models [9, 10, 11], for exactly this intuition. As it turns out, we can show that the VDM formulation we have previously derived has an equivalent Score-based Generative Modeling formulation, allowing us to flexibly switch between these two interpretations at will.

我们已经展示了变分扩散模型可以通过优化神经网络 sθ(xt,t) 来预测得分函数 logp(xt) 简单学习。然而,在我们的推导中,得分项源自 Tweedie 公式的应用;这并不一定能让我们深入理解得分函数的本质或为何值得建模。幸运的是,我们可以参考另一类生成模型——基于得分的生成模型[9, 10, 11],以获得这种直觉。事实证明,我们之前推导的 VDM 形式与基于得分的生成模型形式等价,使我们能够灵活地在这两种解释间切换。

Figure 6: Visualization of three random sampling trajectories generated with Langevin dynamics, all starting from the same initialization point, for a Mixture of Gaussians. The left figure plots these sampling trajectories on a three-dimensional contour, while the right figure plots the sampling trajectories against the ground-truth score function. From the same initialization point, we are able to generate samples from different modes due to the stochastic noise term in the Langevin dynamics sampling procedure; without it, sampling from a fixed point would always deterministically follow the score to the same mode every trial.

图6:用 Langevin 动力学生成的三个随机采样轨迹的可视化,均从相同初始化点开始,针对高斯混合模型。左图在三维等高线上绘制采样轨迹,右图则将采样轨迹与真实得分函数对比。从同一初始化点出发,由于 Langevin 动力学采样过程中的随机噪声项,我们能够从不同模态生成样本;若无该噪声,固定点采样每次都会确定性地沿得分函数路径到达同一模态。

To begin to understand why optimizing a score function makes sense, we take a detour and revisit energy-based models [12, 13]. Arbitrarily flexible probability distributions can be written in the form:

为了开始理解为何优化得分函数合理,我们绕道回顾能量基模型[12, 13]。任意灵活的概率分布可以写成如下形式:

(152)pθ(x)=1Zθefθ(x)

where fθ(x) is an arbitrarily flexible,parameterizable function called the energy function,often modeled by a neural network,and Zθ is a normalizing constant to ensure that pθ(x)dx=1 . One way to learn such a distribution is maximum likelihood; however, this requires tractably computing the normalizing constant Zθ=efθ(x)dx ,which may not be possible for complex fθ(x) functions.

其中 fθ(x) 是一个任意灵活且可参数化的函数,称为能量函数(energy function),通常由神经网络建模,Zθ 是一个归一化常数,用于确保 pθ(x)dx=1 。学习这种分布的一种方法是最大似然估计;然而,这需要可行地计算归一化常数 Zθ=efθ(x)dx,而对于复杂的 fθ(x) 函数,这可能不可行。

One way to avoid calculating or modeling the normalization constant is by using a neural network sθ(x) to learn the score function logp(x) of distribution p(x) instead. This is motivated by the observation that taking the derivative of the log of both sides of Equation 152 yields:

避免计算或建模归一化常数的一种方法是使用神经网络 sθ(x) 来学习分布 p(x) 的得分函数(score function)logp(x)。这一方法的动机来自于观察到对等式152两边取对数的导数得到:

(153)xlogpθ(x)=xlog(1Zθefθ(x))

(154)=xlog1Zθ+xlogefθ(x)

(155)=xfθ(x)

(156)sθ(x)

which can be freely represented as a neural network without involving any normalization constants. The score model can be optimized by minimizing the Fisher Divergence with the ground truth score function:

该函数可以自由地用神经网络表示,而不涉及任何归一化常数。得分模型可以通过最小化与真实得分函数的费舍尔散度(Fisher Divergence)来优化:

(157)Ep(x)[sθ(x)logp(x)22]

What does the score function represent? For every x ,taking the gradient of its log likelihood with respect to x essentially describes what direction in data space to move in order to further increase its likelihood. Intuitively,then,the score function defines a vector field over the entire space that data x inhabits,pointing towards the modes. Visually, this is depicted in the right plot of Figure 6. Then, by learning the score function of the true data distribution, we can generate samples by starting at any arbitrary point in the same space and iteratively following the score until a mode is reached. This sampling procedure is known as Langevin dynamics, and is mathematically described as:

得分函数代表什么?对于每个 x,对其对数似然关于 x 的梯度本质上描述了在数据空间中朝哪个方向移动以进一步增加其似然值。直观地说,得分函数定义了数据 x 所处整个空间上的一个向量场,指向模式(modes)。在图6的右图中对此有直观展示。通过学习真实数据分布的得分函数,我们可以从空间中的任意点开始,迭代地沿着得分方向移动直到达到一个模式,从而生成样本。该采样过程称为朗之万动力学(Langevin dynamics),其数学描述为:

(158)xi+1xi+clogp(xi)+2cϵ,i=0,1,,K

where x0 is randomly sampled from a prior distribution (such as uniform),and ϵN(ϵ;0,I) is an extra noise term to ensure that the generated samples do not always collapse onto a mode, but hover around it for diversity. Furthermore, because the learned score function is deterministic, sampling with a noise term involved adds stochasticity to the generative process, allowing us to avoid deterministic trajectories. This is particularly useful when sampling is initialized from a position that lies between multiple modes. A visual depiction of Langevin dynamics sampling and the benefits of the noise term is shown in Figure 6.

其中 x0 是从先验分布(如均匀分布)随机采样的,ϵN(ϵ;0,I) 是一个额外的噪声项,用以确保生成的样本不会总是坍缩到某个模式,而是在其周围徘徊以保持多样性。此外,由于学习到的得分函数是确定性的,采样过程中加入噪声项为生成过程引入了随机性,避免了确定性轨迹。这在采样初始化位置位于多个模式之间时尤为有用。图6展示了朗之万动力学采样及噪声项带来的益处的可视化效果。

Note that the objective in Equation 157 relies on having access to the ground truth score function, which is unavailable to us for complex distributions such as the one modeling natural images. Fortunately, alternative techniques known as score matching [14,15,16,17] have been derived to minimize this Fisher divergence without knowing the ground truth score, and can be optimized with stochastic gradient descent.

注意,等式157中的目标依赖于获得真实得分函数,而对于如自然图像这类复杂分布,我们无法获得真实得分函数。幸运的是,已有称为得分匹配(score matching)[14,15,16,17]的替代技术被提出,用以在不知道真实得分的情况下最小化费舍尔散度,并可通过随机梯度下降进行优化。

Collectively, learning to represent a distribution as a score function and using it to generate samples through Markov Chain Monte Carlo techniques, such as Langevin dynamics, is known as Score-based Generative Modeling [9, 10, 11].

总体而言,将分布表示为得分函数并通过马尔可夫链蒙特卡洛(Markov Chain Monte Carlo)技术如朗之万动力学生成样本的过程,被称为基于得分的生成建模(Score-based Generative Modeling)[9, 10, 11]。

There are three main problems with vanilla score matching, as detailed by Song and Ermon [9]. Firstly, the score function is ill-defined when x lies on a low-dimensional manifold in a high-dimensional space. This can be seen mathematically; all points not on the low-dimensional manifold would have probability zero, the log of which is undefined. This is particularly inconvenient when trying to learn a generative model over natural images, which is known to lie on a low-dimensional manifold of the entire ambient space.

正如Song和Ermon[9]详细指出的,原始得分匹配存在三个主要问题。首先,当 x 位于高维空间中的低维流形上时,得分函数定义不良。从数学上看,所有不在低维流形上的点概率为零,其对数未定义。这在尝试学习自然图像的生成模型时尤其不便,因为自然图像已知位于整个环境空间的低维流形上。

Secondly, the estimated score function trained via vanilla score matching will not be accurate in low density regions. This is evident from the objective we minimize in Equation 157. Because it is an expectation over p(x) ,and explicitly trained on samples from it,the model will not receive an accurate learning signal for rarely seen or unseen examples. This is problematic, since our sampling strategy involves starting from a random location in the high-dimensional space, which is most likely random noise, and moving according to the learned score function. Since we are following a noisy or inaccurate score estimate, the final generated samples may be suboptimal as well, or require many more iterations to converge on an accurate output.

其次,通过普通得分匹配训练的估计得分函数在低密度区域不会准确。这一点从我们在方程157中最小化的目标可以看出。因为该目标是对p(x)的期望,并且明确地在其样本上训练,模型不会对罕见或未见过的样本获得准确的学习信号。这是个问题,因为我们的采样策略涉及从高维空间的随机位置开始,这很可能是随机噪声,然后根据学习到的得分函数移动。由于我们遵循的是带噪声或不准确的得分估计,最终生成的样本可能也不理想,或者需要更多迭代才能收敛到准确的输出。

Lastly, Langevin dynamics sampling may not mix, even if it is performed using the ground truth scores. Suppose that the true data distribution is a mixture of two disjoint distributions:

最后,即使使用真实得分进行采样,Langevin动力学采样也可能无法混合。假设真实数据分布是两个不相交分布的混合:

(159)p(x)=c1p1(x)+c2p2(x)

Then, when the score is computed, these mixing coefficients are lost, since the log operation splits the coefficient from the distribution and the gradient operation zeros it out. To visualize this, note that the ground truth score function shown in the right Figure 6 is agnostic of the different weights between the three distributions; Langevin dynamics sampling from the depicted initialization point has a roughly equal chance of arriving at each mode, despite the bottom right mode having a higher weight in the actual Mixture of Gaussians.

那么,当计算得分时,这些混合系数会丢失,因为对数操作将系数与分布分开,梯度操作将其置零。为了形象化这一点,注意右侧图6中显示的真实得分函数对三个分布之间不同权重是无感知的;从所示初始化点进行的Langevin动力学采样大致有相等的概率到达每个模态,尽管右下角的模态在实际的高斯混合中权重更高。

It turns out that these three drawbacks can be simultaneously addressed by adding multiple levels of Gaussian noise to the data. Firstly, as the support of a Gaussian noise distribution is the entire space, a perturbed data sample will no longer be confined to a low-dimensional manifold. Secondly, adding large Gaussian noise will increase the area each mode covers in the data distribution, adding more training signal in low density regions. Lastly, adding multiple levels of Gaussian noise with increasing variance will result in intermediate distributions that respect the ground truth mixing coefficients. Formally,we can choose a positive sequence of noise levels {σt}t=1T and define a sequence of progressively perturbed data distributions:

事实证明,这三个缺点可以通过向数据添加多个层次的高斯噪声同时解决。首先,由于高斯噪声分布的支持是整个空间,扰动后的数据样本将不再局限于低维流形。其次,添加较大高斯噪声将增加数据分布中每个模态覆盖的区域,在低密度区域增加更多训练信号。最后,添加多个方差递增的高斯噪声层次将产生尊重真实混合系数的中间分布。形式上,我们可以选择一个正的噪声水平序列{σt}t=1T,并定义一系列逐步扰动的数据分布:

(160)pσt(xt)=p(x)N(xt;x,σt2I)dx

Then,a neural network sθ(x,t) is learned using score matching to learn the score function for all noise levels simultaneously:

然后,使用得分匹配学习一个神经网络sθ(x,t),以同时学习所有噪声水平的得分函数:

(161)argminθt=1Tλ(t)Epσt(xt)[sθ(x,t)logpσt(xt)22]

where λ(t) is a positive weighting function that conditions on noise level t . Note that this objective almost exactly matches the objective derived in Equation 148 to train a Variational Diffusion Model. Furthermore, the authors propose annealed Langevin dynamics sampling as a generative procedure, in which samples are produced by running Langevin dynamics for each t=T,T1,,2,1 in sequence. The initialization is chosen from some fixed prior (such as uniform), and each subsequent sampling step starts from the final samples of the previous simulation. Because the noise levels steadily decrease over timesteps t ,and we reduce the step size over time, the samples eventually converge into a true mode. This is directly analogous to the sampling procedure performed in the Markovian HVAE interpretation of a Variational Diffusion Model, where a randomly initialized data vector is iteratively refined over decreasing noise levels.

其中λ(t)是一个正的加权函数,条件于噪声水平t。注意,该目标几乎完全匹配方程148中推导的用于训练变分扩散模型(Variational Diffusion Model)的目标。此外,作者提出了退火Langevin动力学采样作为生成过程,其中样本通过依次运行每个t=T,T1,,2,1的Langevin动力学产生。初始化从某个固定先验(如均匀分布)中选择,每个后续采样步骤从前一次模拟的最终样本开始。由于噪声水平随时间步长t逐渐减小,且步长随时间减小,样本最终会收敛到真实模态。这与变分扩散模型的马尔可夫HVAE解释中的采样过程直接对应,其中随机初始化的数据向量在递减的噪声水平下迭代精炼。

Therefore, we have established an explicit connection between Variational Diffusion Models and Score-based Generative Models, both in their training objectives and sampling procedures.

因此,我们已经在训练目标和采样过程上建立了变分扩散模型与基于得分的生成模型之间的明确联系。

One question is how to naturally generalize diffusion models to an infinite number of timesteps. Under the Markovian HVAE view,this can be interpreted as extending the number of hierarchies to infinity T . It is clearer to represent this from the equivalent score-based generative model perspective; under an infinite number of noise scales, the perturbation of an image over continuous time can be represented as a stochastic process, and therefore described by a stochastic differential equation (SDE). Sampling is then performed by reversing the SDE, which naturally requires estimating the score function at each continuous-valued noise level [10]. Different parameterizations of the SDE essentially describe different perturbation schemes over time, enabling flexible modeling of the noising procedure [6].

一个问题是如何自然地将扩散模型推广到无限多个时间步。在马尔可夫HVAE视角下,这可以解释为将层级数扩展到无穷大T。从等效的基于得分的生成模型视角来看更清晰;在无限多个噪声尺度下,图像随连续时间的扰动可以表示为一个随机过程,因此可以用随机微分方程(SDE)描述。采样则通过反转该SDE进行,这自然需要在每个连续值噪声水平估计得分函数[10]。SDE的不同参数化本质上描述了随时间变化的不同扰动方案,实现了对加噪过程的灵活建模[6]。

Guidance

指导

So far,we have focused on modeling just the data distribution p(x) . However,we are often also interested in learning conditional distribution p(xy) ,which would enable us to explicitly control the data we generate through conditioning information y . This forms the backbone of image super-resolution models such as Cascaded Diffusion Models [18], as well as state-of-the-art image-text models such as DALL-E 2 [19] and Imagen [7].

到目前为止,我们只关注了建模数据分布p(x)。然而,我们通常也关心学习条件分布p(xy),这使我们能够通过条件信息y显式控制生成的数据。这构成了图像超分辨率模型如级联扩散模型(Cascaded Diffusion Models)[18]的核心,以及最先进的图文模型如DALL-E 2[19]和Imagen[7]。

A natural way to add conditioning information is simply alongside the timestep information, at each iteration. Recall our joint distribution from Equation 32:

一种自然的添加条件信息的方法是在每次迭代时与时间步信息一起添加。回顾我们在公式32中的联合分布:

p(x0:T)=p(xT)t=1Tpθ(xt1xt)

Then, to turn this into a conditional diffusion model, we can simply add arbitrary conditioning information y at each transition step as:

然后,为了将其转化为条件扩散模型,我们可以在每个转移步骤简单地添加任意条件信息y,形式为:

(162)p(x0:Ty)=p(xT)t=1Tpθ(xt1xt,y)

For example, y could be a text encoding in image-text generation,or a low-resolution image to perform super-resolution on. We are thus able to learn the core neural networks of a VDM as before, by predicting x^θ(xt,t,y)x0,ϵ^θ(xt,t,y)ϵ0 ,or sθ(xt,t,y)logp(xty) for each desired interpretation and implementation. A caveat of this vanilla formulation is that a conditional diffusion model trained in this way may potentially learn to ignore or downplay any given conditioning information. Guidance is therefore proposed as a way to more explicitly control the amount of weight the model gives to the conditioning information, at the cost of sample diversity. The two most popular forms of guidance are known as Classifier Guidance [10,20] and Classifier-Free Guidance [21].

例如,y可以是在图文生成中的文本编码,或用于超分辨率的低分辨率图像。因此,我们能够像之前一样通过预测x^θ(xt,t,y)x0,ϵ^θ(xt,t,y)ϵ0sθ(xt,t,y)logp(xty)来学习VDM(变分扩散模型,Variational Diffusion Model)的核心神经网络,以实现所需的解释和实现。该基础形式的一个警示是,以这种方式训练的条件扩散模型可能会学会忽略或弱化任何给定的条件信息。因此,提出了引导(Guidance)作为一种更明确控制模型对条件信息赋权的方法,但代价是样本多样性的降低。两种最流行的引导形式分别称为分类器引导(Classifier Guidance)[10,20]和无分类器引导(Classifier-Free Guidance)[21]。

Classifier Guidance

分类器引导

Let us begin with the score-based formulation of a diffusion model,where our goal is to learn logp(xty) , the score of the conditional model,at arbitrary noise levels t . Recall that is shorthand for xt in the interest of brevity. By Bayes rule, we can derive the following equivalent form:

让我们从基于分数的扩散模型公式开始,我们的目标是在任意噪声水平t下学习条件模型的分数函数logp(xty)。回想一下,为简洁起见,xt的简写。根据贝叶斯定理,我们可以推导出以下等价形式:

(163)logp(xty)=log(p(xt)p(yxt)p(y))

(164)=logp(xt)+logp(yxt)logp(y)

(165)=logp(xt)unconditional score +logp(yxt)adversarial gradient 

where we have leveraged the fact that the gradient of logp(y) with respect to xt is zero.

其中我们利用了logp(y)关于xt的梯度为零的事实。

Our final derived result can be interpreted as learning an unconditional score function combined with the adversarial gradient of a classifier p(yxt) . Therefore,in Classifier Guidance [10,20] ,the score of an unconditional diffusion model is learned as previously derived, alongside a classifier that takes in arbitrary noisy xt and attempts to predict conditional information y . Then,during the sampling procedure,the overall conditional score function used for annealed Langevin dynamics is computed as the sum of the unconditional score function and the adversarial gradient of the noisy classifier.

我们最终推导的结果可以解释为学习一个无条件的分数函数,结合分类器p(yxt)的对抗梯度。因此,在分类器引导[10,20]中,无条件扩散模型的分数函数按之前推导的方式学习,同时训练一个分类器,该分类器接受任意噪声的xt并尝试预测条件信息y。然后,在采样过程中,用于退火朗之万动力学的整体条件分数函数计算为无条件分数函数与噪声分类器对抗梯度的和。

In order to introduce fine-grained control to either encourage or discourage the model to consider the conditioning information,Classifier Guidance scales the adversarial gradient of the noisy classifier by a γ hyper-parameter term. The score function learned under Classifier Guidance can then be summarized as:

为了引入细粒度控制以鼓励或抑制模型考虑条件信息,分类器引导通过一个超参数γ对噪声分类器的对抗梯度进行缩放。在分类器引导下学习的分数函数可以总结为:

(166)logp(xty)=logp(xt)+γlogp(yxt)

Intuitively,when γ=0 the conditional diffusion model learns to ignore the conditioning information entirely, and when γ is large the conditional diffusion model learns to produce samples that heavily adhere to the conditioning information. This would come at the cost of sample diversity, as it would only produce data that would be easy to regenerate the provided conditioning information from, even at noisy levels.

直观地,当γ=0时,条件扩散模型学会完全忽略条件信息;当γ较大时,条件扩散模型学会生成高度遵循条件信息的样本。这会以样本多样性为代价,因为它只会生成那些即使在噪声水平下也能轻易重现所提供条件信息的数据。

One noted drawback of Classifier Guidance is its reliance on a separately learned classifier. Because the classifier must handle arbitrarily noisy inputs, which most existing pretrained classification models are not optimized to do, it must be learned ad hoc alongside the diffusion model.

分类器引导的一个显著缺点是依赖于单独训练的分类器。由于分类器必须处理任意噪声输入,而大多数现有的预训练分类模型并未针对这一点进行优化,因此必须与扩散模型一起专门训练。

Classifier-Free Guidance

无分类器引导

In Classifier-Free Guidance [21], the authors ditch the training of a separate classifier model in favor of an unconditional diffusion model and a conditional diffusion model. To derive the score function under Classifier-Free Guidance, we can first rearrange Equation 165 to show that:

在无分类器引导[21]中,作者放弃了单独训练分类器模型,转而使用无条件扩散模型和条件扩散模型。为了推导无分类器引导下的分数函数,我们可以先重新排列公式165,得到:

(167)logp(yxt)=logp(xty)logp(xt)

Then, substituting this into Equation 166, we get:

然后,将其代入公式166,得到:

(168)logp(xty)=logp(xt)+γ(logp(xty)logp(xt))

(169)=logp(xt)+γlogp(xty)γlogp(xt)

(170)=γlogp(xty)conditional score +(1γ)logp(xt)unconditional score 

Once again, γ is a term that controls how much our learned conditional model cares about the conditioning information. When γ=0 ,the learned conditional model completely ignores the conditioner and learns an unconditional diffusion model. When γ=1 ,the model explicitly learns the vanilla conditional distribution without guidance. When γ>1 ,the diffusion model not only prioritizes the conditional score function, but also moves in the direction away from the unconditional score function. In other words, it reduces the probability of generating samples that do not use conditioning information, in favor of the samples that explicitly do. This also has the effect of decreasing sample diversity at the cost of generating samples that accurately match the conditioning information.

再次强调,γ 是一个控制我们学习到的条件模型对条件信息关注程度的参数。当 γ=0 时,学习到的条件模型完全忽略条件器,变成无条件扩散模型。当 γ=1 时,模型明确学习无指导的基础条件分布。当 γ>1 时,扩散模型不仅优先考虑条件得分函数,还朝远离无条件得分函数的方向移动。换句话说,它降低了生成不使用条件信息样本的概率,偏向于明确利用条件信息的样本。这也导致样本多样性降低,但生成的样本更准确地匹配条件信息。

Because learning two separate diffusion models is expensive, we can learn both the conditional and unconditional diffusion models together as a singular conditional model; the unconditional diffusion model can be queried by replacing the conditioning information with fixed constant values, such as zeros. This is essentially performing random dropout on the conditioning information. Classifier-Free Guidance is elegant because it enables us greater control over our conditional generation procedure while requiring nothing beyond the training of a singular diffusion model.

由于学习两个独立的扩散模型代价高昂,我们可以将条件和无条件扩散模型作为单一条件模型一起学习;通过将条件信息替换为固定常数值(如零),即可查询无条件扩散模型。这本质上是在条件信息上执行随机丢弃。无分类器引导(Classifier-Free Guidance)方法优雅之处在于,它使我们能够更好地控制条件生成过程,同时只需训练一个扩散模型。

Closing

结语

Allow us to recapitulate our findings over the course of our explorations. First, we derive Variational Diffusion Models as a special case of a Markovian Hierarchical Variational Autoencoder, where three key assumptions enable tractable computation and scalable optimization of the ELBO. We then prove that optimizing a VDM boils down to learning a neural network to predict one of three potential objectives: the original source image from any arbitrary noisification of it, the original source noise from any arbitrarily noisified image, or the score function of a noisified image at any arbitrary noise level. Then, we dive deeper into what it means to learn the score function, and connect it explicitly with the perspective of Score-based Generative Modeling. Lastly, we cover how to learn a conditional distribution using diffusion models.

让我们回顾一下探索过程中的发现。首先,我们将变分扩散模型(Variational Diffusion Models, VDM)推导为马尔可夫分层变分自编码器(Markovian Hierarchical Variational Autoencoder)的特例,其中三个关键假设使得ELBO的计算和优化变得可行且可扩展。接着,我们证明优化VDM归结为训练神经网络去预测三种潜在目标之一:任意噪声化图像的原始源图像、任意噪声化图像的原始源噪声,或任意噪声水平下噪声化图像的得分函数。然后,我们深入探讨学习得分函数的含义,并将其明确关联到基于得分的生成建模(Score-based Generative Modeling)视角。最后,我们介绍了如何使用扩散模型学习条件分布。

In summary, diffusion models have shown incredible capabilities as generative models; indeed, they power the current state-of-the-art models on text-conditioned image generation such as Imagen and DALL-E 2. Furthermore, the mathematics that enable these models are exceedingly elegant. However, there still remain a few drawbacks to consider:

总之,扩散模型作为生成模型展现了惊人的能力;事实上,它们驱动了当前文本条件图像生成的最先进模型,如Imagen和DALL-E 2。此外,支撑这些模型的数学原理极为优雅。然而,仍有一些缺点需要考虑:

As a final note, the success of diffusion models highlights the power of Hierarchical VAEs as a generative model. We have shown that when we generalize to infinite latent hierarchies, even if the encoder is trivial and the latent dimension is fixed and Markovian transitions are assumed, we are still able to learn powerful models of data. This suggests that further performance gains can be achieved in the case of general, deep HVAEs, where complex encoders and semantically meaningful latent spaces can be potentially learned.

最后,扩散模型的成功凸显了分层变分自编码器(Hierarchical VAEs)作为生成模型的强大能力。我们展示了当推广到无限潜层次时,即使编码器简单、潜变量维度固定且假设马尔可夫转移,依然能够学习到强大的数据模型。这表明,在一般的深层HVAE中,通过复杂编码器和语义丰富的潜空间,可能实现更进一步的性能提升。

Acknowledgments: I would like to acknowledge Josh Dillon, Yang Song, Durk Kingma, Ben Poole, Jonathan Ho, Yiding Jiang, Ting Chen, Jeremy Cohen, and Chen Sun for reviewing drafts of this work and providing many helpful edits and comments. Thanks so much! References

致谢:感谢Josh Dillon、Yang Song、Durk Kingma、Ben Poole、Jonathan Ho、Yiding Jiang、Ting Chen、Jeremy Cohen和Chen Sun审阅本文稿并提供许多有益的修改和意见。非常感谢!参考文献

[1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[1] Diederik P Kingma 和 Max Welling. 自动编码变分贝叶斯(Auto-encoding variational bayes). arXiv预印本 arXiv:1312.6114, 2013.

[2] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29, 2016.

[2] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever 和 Max Welling. 通过逆自回归流改进变分推断(Improved variational inference with inverse autoregressive flow). 神经信息处理系统进展(Advances in neural information processing systems), 29, 2016.

[3] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016.

[3] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby 和 Ole Winther. 阶梯变分自编码器(Ladder variational autoencoders). 神经信息处理系统进展(Advances in neural information processing systems), 29, 2016.

[4] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256-2265. PMLR, 2015.

[4] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, 和 Surya Ganguli. 使用非平衡热力学的深度无监督学习。发表于国际机器学习大会,页码2256-2265。PMLR, 2015。

[5] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840-6851, 2020.

[5] Jonathan Ho, Ajay Jain, 和 Pieter Abbeel. 去噪扩散概率模型。神经信息处理系统进展, 33:6840-6851, 2020。

[6] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696-21707, 2021.

[6] Diederik Kingma, Tim Salimans, Ben Poole, 和 Jonathan Ho. 变分扩散模型。神经信息处理系统进展, 34:21696-21707, 2021。

[7] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.

[7] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, 等. 具备深度语言理解的逼真文本到图像扩散模型。arXiv预印本 arXiv:2205.11487, 2022。

[8] Bradley Efron. Tweedie's formula and selection bias. Journal of the American Statistical Association, 106(496): 1602-1614, 2011.

[8] Bradley Efron. Tweedie公式与选择偏差。《美国统计协会杂志》, 106(496): 1602-1614, 2011。

[9] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.

[9] Yang Song 和 Stefano Ermon. 通过估计数据分布梯度进行生成建模。神经信息处理系统进展, 32, 2019。

[10] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.

[10] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, 和 Ben Poole. 通过随机微分方程的基于分数的生成建模。arXiv预印本 arXiv:2011.13456, 2020。

[11] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438-12448, 2020.

[11] Yang Song 和 Stefano Ermon. 基于分数的生成模型训练的改进技术。神经信息处理系统进展, 33:12438-12448, 2020。

[12] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

[12] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, 和 F Huang. 能量基学习教程。结构化数据预测, 1(0), 2006。

[13] Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.

[13] Yang Song 和 Diederik P Kingma. 如何训练你的能量基模型。arXiv预印本 arXiv:2101.03288, 2021。

[14] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[14] Aapo Hyvärinen 和 Peter Dayan. 通过分数匹配估计非归一化统计模型。《机器学习研究杂志》, 6(4), 2005。

[15] Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.

[15] Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, 和 Aapo Hyvärinen. 深度能量估计网络。arXiv预印本 arXiv:1805.08306, 2018。

[16] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574-584. PMLR, 2020.

[16] Yang Song, Sahaj Garg, Jiaxin Shi, 和 Stefano Ermon. 切片分数匹配:一种可扩展的密度和分数估计方法。发表于人工智能不确定性会议,页码574-584。PMLR, 2020。

[17] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7): 1661-1674, 2011.

[17] Pascal Vincent. 分数匹配与去噪自编码器之间的联系。《神经计算》, 23(7): 1661-1674, 2011。

[18] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47-1, 2022.

[18] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, 和 Tim Salimans. 用于高保真图像生成的级联扩散模型。《机器学习研究杂志》, 23:47-1, 2022。

[19] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

[19] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, 和 Mark Chen. 基于CLIP潜变量的分层文本条件图像生成。arXiv预印本 arXiv:2204.06125, 2022.

[20] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780-8794, 2021.

[20] Prafulla Dhariwal 和 Alexander Nichol. 扩散模型在图像合成上超越生成对抗网络(GANs)。《神经信息处理系统进展》(Advances in Neural Information Processing Systems), 34:8780-8794, 2021.

[21] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.

[21] Jonathan Ho 和 Tim Salimans. 无分类器扩散引导方法。在NeurIPS 2021深度生成模型及下游应用研讨会,2021年。